The main reason for this project is to demonstrate ability to use R to import, clean, transform and present data. By applying our newly gained knowledge for our selected data sets, we are aiming to use it for analysis, model and chart creation for better understanding of data.
For this purpose we focused on the prices of Airbnb listings, burglary rate and average apartment rent prices in the districts of the city of Zurich in 2022.
The main goal is to find out how burglary rate and neighborhood group influences Airbnb rental prices in the districts of Zurich.
For the analysis we used following variables:
library(tidyverse) # data cleaning, sorting, structuring and preparation
library(plotly) # interactive data vizualisation
# Be aware of working directory to import data sets correctly
getwd()
Firstly, all quarterly data is merged together.
tmp_ab_combained <- rbind(ab_listing_0922,
ab_listing_1222,
ab_listing_0323,
ab_listing_0623)
After merging quarterly data in one data set, we select only the columns of our interest.
tmp_ab_combained_1 <- tmp_ab_combained %>%
select(id, name, neighbourhood_group, neighbourhood, latitude,
longitude, room_type, price)
Reviewing combined data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.
str(tmp_ab_combained_1)
## 'data.frame': 8943 obs. of 8 variables:
## $ id : num 73282 143821 178448 204586 222565 ...
## $ name : chr "Clean, central, quiet" "marvelous LOFT in SIHLCITY Zürich" "a lovely place, top location" "very nice luxury city apartment" ...
## $ neighbourhood_group: chr "Kreis 3" "Kreis 3" "Kreis 2" "Kreis 10" ...
## $ neighbourhood : chr "Sihlfeld" "Alt-Wiedikon" "Enge" "Höngg" ...
## $ latitude : num 47.4 47.4 47.4 47.4 47.3 ...
## $ longitude : num 8.52 8.52 8.53 8.48 8.54 ...
## $ room_type : chr "Entire home/apt" "Entire home/apt" "Private room" "Private room" ...
## $ price : int 100 200 60 200 75 650 148 72 159 210 ...
Since we combined 4 data sets of Airbnb listings, it is quite likely, that in the past year there were listings, which appeared in all quarterly lists for the past year. They must be detected by listings id and eliminated for further analysis.
tmp_ab_no_duplicates <- tmp_ab_combained_1 %>%
distinct(id, .keep_all = TRUE)
Data set contains little information about hotel and shared rooms. In order to make our data for analysis more consistent, we remove these types of listings from our final cleaned data set.
room_type_table <- table(tmp_ab_no_duplicates$room_type)
room_type_table
ab_final <- tmp_ab_no_duplicates %>%
filter(room_type %in% c("Entire home/apt", "Private room"))
head(ab_final)
unique(ab_final$room_type)
# checking the data if there is some missing values
anyNA(ab_final)
Quick final glimpse of cleaned data set.
head(ab_final, n = 1)
## id name neighbourhood_group neighbourhood latitude
## 1 73282 Clean, central, quiet Kreis 3 Sihlfeld 47.37374
## longitude room_type price
## 1 8.51957 Entire home/apt 100
Reviewing structure of data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.
str(burglary)
## 'data.frame': 7350 obs. of 13 variables:
## $ Ausgangsjahr : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ Gemeinde_BFS_Nr : int 21 21 21 131 131 131 241 241 241 1 ...
## $ Gemeindename : chr "Adlikon" "Adlikon" "Adlikon" "Adliswil" ...
## $ Stadtkreis_BFS_Nr : int NA NA NA NA NA NA NA NA NA NA ...
## $ Stadtkreis_Name : chr "" "" "" "" ...
## $ Gesetz_Nummer : int 311 311 311 311 311 311 311 311 311 311 ...
## $ Gesetz_Abk : chr "StGB" "StGB" "StGB" "StGB" ...
## $ Tatbestand : chr "Einbruchdiebstahl" "Einschleichdiebstahl" "Einbrüche insgesamt" "Einbruchdiebstahl" ...
## $ Straftaten_total : int 7 0 7 159 33 192 10 2 12 3 ...
## $ Straftaten_vollendet: int 6 0 6 114 32 146 7 2 9 3 ...
## $ Straftaten_versucht : int 1 0 1 45 1 46 3 0 3 0 ...
## $ Einwohner : int 569 569 569 16052 16052 16052 987 987 987 1700 ...
## $ Häufigkeitszahl : num 12.3 0 12.3 9.9 2.1 12 10.1 2 12.2 1.8 ...
Filtering for year 2022, city of Zurich, which municipality’s number is 261 and category of total crime in the district.
burglary_final <- burglary %>%
select(Year, MunicipalityBFSNr, DistrictName, FactsEnglish,
OffencesTotal, Residents) %>%
filter(Year == 2022,
MunicipalityBFSNr == 261,
FactsEnglish == "TotalCrime",
grepl("^Kreis", DistrictName) ) %>%
mutate("BurglariesPerResident" = OffencesTotal/Residents) %>%
arrange(desc(BurglariesPerResident))
Quick final glimpse of cleaned data set.
head(burglary_final, n = 1)
## MunicipalityBFSNr DistrictName FactsEnglish OffencesTotal Residents
## 1 261 Kreis 1 TotalCrime 153 5411
## BurglariesPerResident
## 1 0.02827573
Reviewing structure of data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.
str(rent)
Selecting only the column that we need for our analysis.
rent_final <- rent %>%
select(RaumeinheitLang, GliederungLang, ZimmerSort, GemeinnuetzigLang,
EinheitLang, PreisartLang, mean) %>%
filter(grepl("^Kreis", GliederungLang),
GemeinnuetzigLang == "Alle Wohnungen",
RaumeinheitLang == "Stadtkreise",
PreisartLang == "Netto",
EinheitLang == "Wohnung") %>%
group_by(GliederungLang) %>%
summarise(avg.price = round(mean(mean), 0)) %>%
rename(neighbourhood_group = GliederungLang) %>%
arrange(desc(avg.price))
Quick final glimpse of cleaned data set.
head(rent_final, n = 1)
## # A tibble: 1 × 2
## neighbourhood_group avg.price
## <chr> <dbl>
## 1 Kreis 1 2317
We have three cleaned and trimmed data sets. Left_join function is used to merge them together by the name of the district.
master_final <- ab_final %>% left_join(burglary_final,
by = "neighbourhood_group") %>%
left_join(rent_final, by = "neighbourhood_group")
# write.csv(master.final, file = "master.final.csv", row.names = FALSE) #NEW
# Quick check
head(master_final)
## id name neighbourhood_group
## 1 73282 Clean, central, quiet Kreis 3
## 2 143821 marvelous LOFT in SIHLCITY Zürich Kreis 3
## 3 178448 a lovely place, top location Kreis 2
## 4 204586 very nice luxury city apartment Kreis 10
## 5 222565 Bedroom overlooking the lake near Kreis 2
## 6 227039 *Luxury Penthouse in the heart of trendy Zurich* Kreis 5
## neighbourhood latitude longitude room_type price MunicipalityBFSNr
## 1 Sihlfeld 47.37374 8.51957 Entire home/apt 100 261
## 2 Alt-Wiedikon 47.35724 8.52304 Entire home/apt 200 261
## 3 Enge 47.36565 8.52753 Private room 60 261
## 4 Höngg 47.40656 8.48465 Private room 200 261
## 5 Wollishofen 47.33463 8.54117 Private room 75 261
## 6 Escher Wyss 47.38942 8.51881 Entire home/apt 650 261
## FactsEnglish OffencesTotal Residents BurglariesPerResident avg.price
## 1 TotalCrime 304 48865 0.006221222 1564
## 2 TotalCrime 304 48865 0.006221222 1564
## 3 TotalCrime 251 35494 0.007071618 1803
## 4 TotalCrime 180 39719 0.004531836 1614
## 5 TotalCrime 251 35494 0.007071618 1803
## 6 TotalCrime 194 15331 0.012654100 1801
The violin plot below shows the relationship of neighborhood group to price per night in CHF for two groups entire home/apartment and private room. For the group entire home/apartment the shape of Kries 3,7,8 ( skinny on each end and wide in the middle) indicates the price of this type of accommodation located around the median. The box plot elements show the median of Kries 12 is lower than for another neighborhood group. Regarding the private room, we have a distribution shape around the median for Kries 2,4,6,10. Kries 10 has the lowest median price. Focusing on the median value, entire home/apartment has a higher median than private room for all neighborhood groups. This is because entire home/apartment are generally more expensive than private room. Also in this case, we use the libraries “ggplot2” and “plotly”. In addition, the type of graph that we’ve selected is the box plot and violin plot. The reason that shows us the distribution of numerical data and skewness. In particular, with the interactive graph, it’s possible to visualize the median, quartiles, and outliers.
Across all districts, the dominant type of accommodation offered on Airbnb is entire home/apartment. Kries 3 and 4 possess the highest percentage of entire home/apartment while Kries 12 has the smallest share of such a type of accommodation. On the other hand, Kries 4 offers the most private rooms for rent, while Kries 12 offers the fewest. This observation can be explained by the fact that Kries 3 and 4 may be close to the city center or that this district is located in a touristic accommodation area. However, Kries 12 may be located in the inner city of Zurich; it’s more of a residential area than a tourist area.
The data in the plot illustrates that Langstrasse has the highest number of accommodations, followed by Altstetten and Sihlfeld. These neighborhoods correspond to Kies 4, 9, and 3, respectively. In the other direction, Saatlen, which belongs to Kries 12, has the lowest number of accommodations.
The burglary rate per neighborhood varies from 0.3 to almost 3 per hundred residents in the area. District number 1 has the highest burglary rate with almost 3 burglaries per 100 residents, which is far above the average burglary rate in other neighborhoods. It could be explained by the fact that in this district the central train station is located, which tends to attract all sorts of crowds, as well as the infamous Langstrasse street, which is packed with special entertainment and activities.
We aim to conduct a basic linear model analysis to assess whether alterations in burglary rates have a substantial impact on property prices. In essence, we seek to determine if changes in price are linked to burglaries.
##
## Call:
## lm(formula = price ~ BurglariesPerResident, data = less_than_500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.54 -56.73 -20.00 34.72 362.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.663 2.699 48.043 < 2e-16 ***
## BurglariesPerResident 1339.426 260.358 5.145 2.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.13 on 2679 degrees of freedom
## Multiple R-squared: 0.009783, Adjusted R-squared: 0.009413
## F-statistic: 26.47 on 1 and 2679 DF, p-value: 2.875e-07
According to the model, we have a significant relationship between burglary rates and property prices. To rephrase it, it would mean that variations in safety levels, particularly in terms of burglary incidents, have an impact on property pricing.
As the plot shows, the distribution of the point does not follow a linear dispersion. Most likely, aiming for another model will be more adequate.
After all these considerations, we conclude that burglaries Per Resident and price have no relation to causality. Every price depends on other parameters, for example, location and type of accommodation. As a result, the price also depends on it. Moreover, we want to specify an aspect related to the price that will appear frequently in our data. Some prices are very high; they don’t reflect reality, so we decided to remove them.
First of all, we want to perform an ANOVA test to evaluate if changing the burglaries and neighborhood group significantly affects the price. In other words, we want to measure if price variation is influenced by location and safety level.
## Df Sum Sq Mean Sq F value Pr(>F)
## (Intercept) 1 96972530 96972530 521.692 < 2e-16 ***
## BurglariesPerResident 1 3282863 3282863 17.661 2.72e-05 ***
## neighbourhood_group 10 2093493 209349 1.126 0.338
## Residuals 2832 526414770 185881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary result shows that the ANOVA model is significant for burglaries per resident at a significance level of alpha = 0.05, but not for neighborhood group. In fact, there is a main effect of burglaries per resident on the price, F(1,2832), p = 2.72e-05. However, there is no main effect of neighborhood group on the price, F(1,2832), p = 0.338.
For the first part of chapter of choice, we decided to extract some information from the column “name”, which containes mixed information about the particular listing. Firstly, we spot the pattern, that was repetitive in the column, then with the help of regex we extracted information and enriched our data set with the information of how many bedrooms, beds and bathrooms are in each listing. That was done only with the data of most recent listings data and was not used for further data exploration.
for (i in 1:nrow(ab_listing_0623)){
row_words <- unlist(str_split(ab_listing_0623[i, 2], " ")) # Split second
# column's information by space
bedroom_idx <- which(str_detect(row_words, "^bedr")) # extract words with bedr
beds_idx <- which(str_detect(row_words, "^bed")) # extract words with bed
bath_idx <- which(str_detect(row_words, "bath|baths")) # extract words with bath
# Check if bedroom indexes are found and populate newly created row
if (length(bedroom_idx) > 0){
ab_listing_0623[i, "NrOfBedrooms"] <- as.numeric(row_words[max(1, bedroom_idx - 1)])
# we know, that the index of number is one less than the index of the word that we
# are looking for.
} else {
ab_listing_0623[i, "NrOfBedrooms"] <- NA
}
# Check if beds indexes are found and populate newly created row
if (length(beds_idx) > 0){
ab_listing_0623[i, "NrOfBeds"] <- as.numeric(row_words[max(1, beds_idx - 1)])
# we know, that the index of number is one less than the index of the word that we
# are looking for.
} else {
ab_listing_0623[i, "NrOfBeds"] <- NA
}
# Check if bath indexes are found and populate newly created row
if (length(bath_idx) > 0){
ab_listing_0623[i, "NrOfBaths"] <- as.numeric(row_words[max(1, bath_idx - 1)])
# we know, that the index of number is one less than the index of the word that we
# are looking for.
} else {
ab_listing_0623[i, "NrOfBaths"] <- NA
}
}
## id name host_id
## 1 73282 Rental unit in Zurich · ★4.78 · 1 bedroom · 1 bed · 1 bath 377532
## host_name neighbourhood_group neighbourhood latitude longitude
## 1 Simona Kreis 3 Sihlfeld 47.37167 8.51948
## room_type price minimum_nights number_of_reviews last_review
## 1 Entire home/apt 100 1125 49 2019-04-27
## reviews_per_month calculated_host_listings_count availability_365
## 1 0.36 1 358
## number_of_reviews_ltm license NrOfBedrooms NrOfBeds NrOfBaths
## 1 0 NA 1 1 1
For the second part of chapter of our choice, we decided to create interactive map with Leaflet and Shiny packages. Our data sets consist of complex layers of data, which is easier to explore in interactive manner. This small application made with help of Shiny package. Application could be found in the 4_shinyApp folder separate from this document. Application allows the user to compare various information layers of Airbnb listings, burglary rate and average rent prices in each district or between them. Layout of application is chosen to help user to understand the data better.